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CN Abstract 

Conventional multiclass conditional probability estimation often requires distributional model 
assumptions such as Fisher's discriminate analysis and logistic regression. In this paper, 
a model-free estimation method is proposed to estimate multiclass conditional probability 
through a series of conditional quantile regression functions. Specifically, each quantile re- 
gression is constructed by optimizing a regularized formulation in a reproducing kernel Hilbert 

^ space, and then the conditional class probability can be well estimated by converting the quan- 

tile regression functions to the corresponding cumulative distribution functions. The primary 
advantage of the proposed estimation method is that it does not assume any distributional 
assumption and its computation cost does not increase with the number of classes. The theo- 
retical and numerical studies demonstrate that the proposed estimation method is highly com- 

Q> petitive against other alternatives, especially when the number of classes is relatively large. 
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1 Introduction 

Conditional class probability estimation is important in statistical machine learning since it mea- 
sures the strength and confidence of the classification outcomes. In multiclass classification, a 
training sample {(xj, y^)\ % — 1, 2, . . . , n} is available with covariate Xj G W and class label G 
{1,2,..., K}, where K is the number of classes. The conditional distribution of Y given X = x 
can be completely characterized by the conditional class probability Pfc(x) = Pr(Y = fc|X = x). 
Estimation of Pfe(x) is the primary goal of this paper, which is also known as the soft classification 
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(Wu, Zhang and Liu, 2010), as opposed to the hard classification that mainly focuses on predicting 
the class labels instead of estimating conditional class probabilities. 

Conventional multiclass conditional probability estimation methods often require various dis- 
tributional model assumptions. For instances, Fisher's discriminant analysis assumes that the co- 
variates within each class follow multivariate Gaussian distributions with homogeneous or het- 
eroscedastic covariance matrices. Relaxing the Gaussian distribution assumption, the multiple 
logistic regression takes one class as baseline and assumes the logarithms of all the odds ratios are 
linear functions of the covariates. These estimation methods have been popularly used in prac- 
tice, however it is difficult to verify the distributional model assumptions and thus may lead to 
suboptimal performance when the assumptions are violated. 

To circumvent the restrictive distributional assumptions, Wang, Shen and Liu (2008) proposes 
a model-free binary conditional probability estimation method by bracketing the conditional prob- 
ability through a series of weighted binary large-margin classifiers with various weights n e (0,1). 
The method is based on the property that the consistent weighted binary large-margin classifiers 
aim at estimating sign(pi(x) — n), and hence that the small bracket (n, n') containing pi(x) can be 
obtained based on the estimated sign(pi(x) — n) for different 7r's. To extend the binary estimation 
method to multiclass classification, a number of attempts have been proposed. Hasite and Tib- 
shirani (1998) and Wu, Lin and Weng (2004) develop a pairwise coupling method by converting 
a multiclass probability estimation into estimating multiple one-vs-one binary conditional proba- 
bilities. Wu, Zhang and Liu (2010) directly extends the idea of Wang, Shen and Liu (2008) and 
designs an interesting way of assigning weights to the multiple classes, and then the estimation 
can be obtained by searching for the K- vertex polyhedron that contains Pfc(x). However, both 
methods require intensive computational cost as the number of one-vs-one binary classifications is 
proportional to K 2 and the number of K- vertex polyhedrons increases with K exponentially. 

In this paper, an efficient bracketing scheme is proposed for estimating the multiclass con- 
ditional probability via a series of conditional quantile regression functions. The key idea is that 
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estimation of the multiclass conditional probability Pfc(x) is equivalent to estimation of the cumula- 
tive distribution function Pr(Y < fc|X = x), which can be obtained through a series of estimated 
conditional quantiles of Y given X = x. The proposed estimation method is computationally 
efficient in that its computational cost does not increase with K at all. The solution surface of 
the regularized quantile regression estimation (Rosset, 2009) can further alleviate the computation 
burden. More importantly, the asymptotic property of the proposed estimation method is estab- 
lished, which shows that the proposed estimation method achieves a fast convergence rate to the 
true pfc(x). The simulation studies and real data analysis also confirm that the proposed estimation 
method is highly competitive against its alternatives. 

The rest of the paper is organized as follows. Section 2 briefly reviews the existing quan- 
tile regression techniques and presents the proposed multiclass conditional probability estimation 
method along with a tuning parameter selection criterion. Section 3 establishes the asymptotic 
convergence property of the proposed method. Section 4 examines the numerical performance 
of the proposed estimation method in both simulated examples and real applications. Section 5 
contains some discussion, and the appendix is devoted to technical proofs. 

2 Multiclass probability estimation via quantile regression 

This section first briefly reviews the existing quantile regression techniques, and then presents the 
novel model-free estimation method for multiclass conditional probability by fitting a series of 
conditional quantile regressions. 

2.1 Quantile regression 

Quantile regression (Koenker and Bassett, 1978; Koenker, 2005) aims at estimating quantiles of 
conditional distribution of response Y as functions of covariates X. It has attracted enormous inter- 
est from statistics community due to its robustness against outliers and more thorough analysis on 
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the relation between the response and the covariates. More detailed reviews on quantile regression 
can be found in Koenker (2005) and Yu, Lu and Stander (2003). 

In quantile regression, the response Y can be either continuous or discrete, and the r-th quantile 
of Y for given X = x is defined as 

/*(x) = argmin {y : Pr(Y < y|X = x) > r}. 

y 

To estimate /*(x), a check loss function is often employed, 



Pr{r) 



The check loss function is consistent in that 



rr, r > 0; 
(r — l)r, r < 0. 



/ T *(x) = argmin E(p T (Y - /(X))), (1) 
/ 

where the minimization is taken with respect to all functions, and the expectation is taken with 
respect to (X, Y). The check loss functions with various r's are displayed in Figure 1. 



Figure 1 here 

Motivated by ([I]), a regularized formulation for estimating /*(x) can be constructed as 

min y^PriVi - /r(xj)) + AJ(/ T ), (2) 

1=1 

where J 7 is a candidate function class, J(f T ) is a regularization term to prevent overfitting, and 
A > is a tuning parameter controlling the tradeoff between the model fitting and complexity. 
When the candidate function class T is set as the reproducing kernel Hilbert spaces (RKHS; Wahba 
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1990) as in Li, Liu, and Zhu (2007), @ becomes 

n X 
mm Vpr(^-/r(x i )) + -||/ r ||^ , (3) 

Jt&Hk — f £ 
1=1 

where Hk is a RKHS induced by a pre-specified kernel function K(-, •), and J(f T ) = l\\fr\\n K * s 
the associated RKHS norm. More importantly, Li, Liu, and Zhu (2007) shows that the estimated 
/ r (x) based on ^ converges to /*(x) in terms of e(f T , /*) = R(f T ) — R(f*) f° r an Y r > where 
R(f T ) = E(p T (Y - f T (X))). 

2.2 Multiclass probability estimation 

In multiclass classification with Y G {1, . . . , K}, the estimation of pfe(x) is equivalent to the 
estimation of Pr(Y < k\K = x) due to the following decomposition, 

p fc (x) = Pr(Y < k\X = x) - Pr(F < k - 1|X = x), (4) 

where Pr(F < fc|X = x) = Xlj=iPj( x ) ^ s m e conditional cumulative distribution function of Y 
given X = x. Furthermore, the estimated Pr(Y < fc|X = x) can be constructed through a series 
of quantile regression function /*(x)'s, since based on the definition of /*(x), 

Pr(F < jfc|X = x) = argmax {/ T *(x) < fc}. (5) 

Note that K is discrete and only takes values in {1, • • • , K}, and estimating discrete quantiles 
can encounter various difficulties such as discontinuity as discussed in Machado et al. (2005) and 
Chen et al. (2010). A simple treatment is to jitter the discrete response by adding some continuous 
noises. In specific, denote the jittered response Y = Y + e, where e follows a uniform distribution 
on (—0.5, 0.5) and is independent of Y, and denote /*(x) as the r-th quantile of Y given X = x. 
With jittering, Y is continuous, Pr(Y < y) is strictly increasing in y, and thus /*(x) is also 
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continuous and strictly increasing in r. More importantly, Pr(Y < k) = Pr(Y < k + 0.5), and 
/* (x) can be obtained as 

/ T *(x) = /c if e (k - 0.5, k + 0.5). 

Combining the above results, /*(x) can be explicitly connected with ^^(x) as in Lemma 1. 

Lemma 1 The r-th quantile of Y = Y + e given X = xij 

fc-i 

r -EPj( x ) fe-i k 
/;(x) = fc - 0.5 + , if r p .( x ) < r < r Pj .( x ), (6) 

where Po( x ) * s to ^ e simplicity. 

By Lemma 1, Pr(F < k + 0.5|X = x) = argmax {/*(x) < + 0.5}, and then 

r 

p fc (x) = Pr{Y < k + 0.5|X = x)- Pr{Y < k- 0.5|X = x) 

= argmax {/*(x) < k + 0.5} — argmax {/*(x) < k — 0.5}. (7) 

T T 

Therefore, estimation of Pfc(x) boils down to estimating quantile regression function /*(x) for 
various r's. Specifically, let = r < T\ < r 2 < • • • < x m _i < r m = 1 be a sequence of r's, 
and / n (x), /r 2 (x), . . ., /r m _i(x) be the estimated /*(x)'s. For simplicity, set / ro ( x ) — 0.5 and 
/ Tm (x) = if + 0.5. According to ([7]), Pfc(x) can be estimated as 

Pfe(x) = argmax {/ r (x) < A; + 0.5} — argmax {/ rj (x) < k — 0.5}. (8) 

As computational remarks, the proposed estimation method in ([8]) only requires fitting m — 1 
conditional quantile regression functions, and thus its computation cost does not increase with if at 
all. The grid points T\, . . . , r m _! can be simply set as equally spaced points on (0, 1), and more so- 
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phisticated adaptive design can be employed as well. Although the true /* (x) is strictly increasing 



tent with order of /* (x) (He, 1997), which may lead to suboptimal estimation of p/c(x) in practice. 
Some non-crossing constraints then can be enforced to improve the estimation performance (Wu 



parameter needs to be appropriately determined. 

2.3 Model tuning and solution surface 

In this section, a data adaptive model tuning method for multiclass conditional probability esti- 
mation is developed. To indicate the dependency on the tuning parameter A, denote the estimated 
conditional probability as />a( x ) = (Pi( x ) ; • • • iPk(x)) t and the quantile regression function as 
/a,t(x). The overall performance of p\(x.) in estimating p(x) = (pi(x), . . . ,p^(x)) T is evaluated 
by the generalized Kullback-Leibler (GKL) loss between p and p\, 



in r, the fitted quantile regression functions / T (x) can cross each other and thus become inconsis- 



and Liu, 2009; Liu and Wu, 2011). In addition, the estimated / Tj (x) can be obtained by solving 
Q, whose performance largely depends on the choice of tuning parameter A, therefore the tuning 




(9) 



The corresponding comparative GKL loss, after omitting pA-unrelated terms in (|9]), is 



K 



GKL c (p,p x ) = -^£(^(X)log(p fe (X))). 



k=i 



It is natural to estimate GKL c (p, p\) by its empirical version, 



K 



■n 



EGKL(px) 



= —n 



i 



^^/(F 4 = A;)logp fc (x,), 



(10) 



k=l i=l 
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where /(•) is an indicator function. However, EGKL(p x ) often underestimates GKL c (p,p x ) 
especially when the estimation model is over-complicated. 

To remedy the underestimation bias, GKL c (p,p x ) can be estimated similarly as in Wang, 
Shen and Liu (2008) by searching for the optimal correction terms for EGKL(p x ). Specifically, 
minimizing the L 2 distance between GKL c (p,p x ) and a class of candidate estimators of form 
EGKL(px) + X n -dependent penalty with X n = {x;}^ =1 yields that 

K n _ 

GKL{p,p x ) = EGKL c (p x ) +n- 1 J2J2^ v ( 1 ^ = k ^ l°g(p fc (x,))|X") + D n (p x , X"), 

k=l i=l 

where D n (p x ,X n ) = £f =1 E^" 1 E7=iPk(^i) logfe(x 4 )) - E(p k (X) log(p fc (X)))|X"). Here, 
Cov (/(Fj = k), log(p fc (xj))|X n ) evaluates the accuracy of estimating p k on X n , which is similar 
to the covariance penalty in Efron (2004) and the generalized degree of freedom in Shen and Huang 
(2006), and the term D n (p x , X n ) is a correction term adjusting the effect of random covariates X 
on prediction and needs to be estimated, c.f., Breiman and Spector (1992), and Breiman (1992). 

To construct the estimated Cov(/(Yj = k), log(pfc(xj))|X™) and D n (p x , X n ), the data pertur- 
bation technique (Wang and Shen, 2006) can be adopted. The key idea is to evaluate the general- 
ization ability of the probability estimation method by its sensitivity to the local perturbations of X 
and Y. The estimation formula can be derived via derivative estimation and approximated through 
a Monte Carlo approximation. The exact expressions are similar to (11) and (12) in Wang, Shen 
and Liu (2007) and thus omitted here. 

Note that the data perturbation technique requires fitting the quantile regression function mul- 
tiple times for various r's and A's, and thus can be computationally expensive. To further reduce 
the computation cost, the solution surface of the coefficient of /a,t(x) with respect to A and r can 
be constructed following Rosset (2009). In particular, Li et al. (2007) and Takeuchi et al. (2009) 
show that the solution path of f\ >T (x) is piecewise linear with respect to A (or r) when r (or A) is 
fixed; Rosset (2009) explores the bi-level path of regularized quantile regression and shows that 
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the solution surface of /a,t( x ) can be efficiently constructed with respect to both A and r. The 
solution surface is mapped as a piecewise linear function of t or A and the possible locations of the 
bi-level optima can be found in one run of the base algorithm. That being said, the coefficient of 
/a, t ( x ) f° r various A's and r's can be obtained at essentially the same computation cost as fitting 
one time of the base algorithm. Figure 2 displays / t ,a(x) for a fixed x as a function of A and r in 
a randomly selected replication of the simulated Example 1 . 

Figure 2 here 

3 Statistical learning theory 

This section establishes the asymptotic convergence of the proposed multiclass conditional proba- 
bility estimation method, measured by 

K K 

Hpa = \\Pk-Pk\\i = ^£|p fc (X) -p k (X)\. 

k=l k=l 

The convergence rate is quantified in terms of the tuning parameter A, the number of brackets m, 
sample size n, and the cardinality of J 7 . 

3.1 Asymptotic theory 

The following technical assumptions are made. 

Assumption 1. For any r e (0, 1), there exists f T G J 7 , such that e(/ T , /*) < s n for some 
positive sequence s n — y as n — > oo. 

This is analogous to Assumption 1 in Wang et al. (2008) and ensures that the true quantile 
regression function /* can be well approximated by J 7 . 

Assumption 2. For any r e (0, 1) and f £ J 7 , there exist constants a x > and < a < 1 such 
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that 

W,f* T )T > ai \\f - fi^. 

Assumption 2 describes the local smoothness of /(x) within the neighborhood of /*(x). Note 
that e(f,f*)=E(h T (X,Y)) with 

Mx, y) = /(/;(x) < V < /(x))(/(x) - y) + 7(/(x) < y < /;(x))(y - /(x)) 

by Lemma 4 in Li et al. (2007), so Assumption 2 is the same as Assumption A in Li et al. (2007). 

Next we measures the cardinality of T by the L 2 -metric entropy with bracketing. Given any 
e > 0, {(f^, fa), a = 1, • • • , A} is an e-bracketing function set of T if for any / G J there exists 
an a such that f l a <f< and \\f a — /" || 2 < e for all a = 1, . . . , A. The L 2 -metric entropy with 
bracketing H B (e, J 7 ) is then defined as the logarithm of the cardinality of the smallest e-bracketing 
function set of 7. Denote T{h) = {/ G J 7 : J(f) < k}, J 7 ^ = {/ G J 7 : J(/) < oo} and 
Jo = minmax{ J(f r ), 1}. 

r 

Assumption 3. For some positive constants a 2 , a 3 and a 4 , there exists some e n > such that 

sup </>(e n , fc) < a 2 n 1/2 , (11) 
fe>i 

where 0(e n , fc) = ^ D 7 H l ^ 2 (u, J r (k))du and /J = £>(e n , A, fc) = min{e,^ + (fc — 1)A J , !}■ 

Theorem 1 Suppose Assumptions 1-3 are met, and there exists T > swc/i p T (y — /(x)) < T 
/or any / G J. For p A (x) obtained as in (|#]), 

/ 4^ \ 

^Pr IIpa - Pill > + 2Km 2 aT 1 5l a < 7mK exp (-a 5 n(A J ) 2 " a ) , (12) 

\ m / 

provided that A J < <^j/2, where b\ = min{max(e 2 , s n ), 1}. 
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Corollary 1 Under the assumptions in Theorem 1, 

\\PX-P\\i = O p + m 2 a r 1 C) , m -Plli = O ^ +mV i e) , 

provided that n(X Jo) 2 ~ a — log(m) diverges as n — > oo. 

Theorem 1 and Corollary 1 provide probability and risk bounds for \\p x — p\\i- They also 
suggest the ideal m to be of order 0(5n 2a ^ 3 ), yielding the fast rate of O p (5^^) for \\p\ — p\\i. 

3.2 A theoretic example 

To illustrate the asymptotic theory, a simple theoretic example is considered. Let X be sampled 
from a uniform distribution on (0, 3) and Y G {1, 2, 3} be sampled according to Pk(x) = 0.8 if 
k - 1 < x < k and 0.1 otherwise. Let T\ = {/ : / G Hk, f(x) G (0.5, 3.5)}, where K is the 
Gaussian kernel. 

To verify Assumption 1, note that for any r G (0, 1), f*(x) is continuous in x except at x — 1 
and x = 2. For given s n , define 



f* T (k -f) + x -^l(f;(k + f ) - /;(* - f )), if x G (fc - f , fc + f ), fc = 1, 2; 
f*(x), otherwise, 



then g T (x) is a continuous function of x, and ||g T — /*||i < s n /2. Furthermore, as g T (x) is 
continuous, Steinwart (2001) shows that there exists a / T G J~i such that ||g T — / T ||i < ||<?t — 
/rlloo < «n/2. Therefore, ||/ T - /;||i < ||# r - + \\g T - / T ||i < s n . Since |p r (y - / T (a:)) - 
p T (y - r T {x))\ < \(y - f T (x)) -(y- f*(x))\ = \f T (x) - f;(x)\, then 

e(/ T , £) = E(p r (F - / T (X)) - p T (Y - ~f T {X))) < E\f T {X) - f*{X)\ = \\f T - fj, < s n . 
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To verify Assumption 2, note that e(f, f*) = E(h T (X, Y)) = E^E(h T (X, F)|X)), and 



E(h T {X,Y)\X) 

E (l(J;(X) <y< f(X))(f(X) -Y) + I(f(X) <Y< f* T (X))(Y - f(X))\X 



fix) 



P x {u){f{X)-u)du 



/•?(*) 



> 0.1 



(f(X) - u)du 



/?(*) 



o.o5|/;(x)-/(x)p 



where Px(u) = Pk(X) if k — 0.5 < u < A; + 0.5. Therefore, Assumption 2 is satisfied with a = 0.5 



and at = \/0.05. 

To verify Assumption 3, since Hb(u, F{k)) = 0(\og 2 (k/u)) (Zhou, 2002) for any given k and 
0(e n , k) is nonincreasing in D, there exist positive constants ci, c 2 , such that 



1 r l /2 D"/> 

sup0(e n ,fc) < 0(e„, 1) = — / 

fe>l ^ Ja A D 



ci\og{l/u)du < c 2 log(l/e n )/e 



2-a 



Without loss of generality, assume s n < e 2 n < 1, and then b 2 n = e 2 n . Solving (11), yields that 

^ = o((^) 1 /^),whenAJ ~e 

Finally, by Corollary 1, E\\p\ — = O (— + m 2 a^ 1 n~ 1 ^ 3 {\ogn) 2 ^ 3 ) . This implies that 
E\\p\— p\\i = 0(n _1 / 9 (logn) 2 / 9 ) when m is set as 0(n 1//9 (logn)^ 2 / 9 ). 



4 Numerical results 

This section examines the effectiveness of the proposed multiclass probability estimation method 
in simulated examples. The numerical performance of the proposed method (KQR) is compared 
against a number of popular alternatives: cumulative logistic model (CLM), baseline logistic model 
(BLM), kernel multiclass logistic regression (KMLR), classification tree (TREE) and random for- 
est (RF). For fair comparison, the kernel function used in each method is set as the Gaussian kernel 
K(zi,z 2 ) = e~H Zl ~ Z2 H 2 / 2CT2 , where the scale parameter a 2 is set as the median of pairwise Eu- 
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clidean distances within the training set. To optimize the performance of each estimation method, 
a grid search is employed to select the tuning parameter as in Section 2.3. The grid used in all 

s — 31 

examples is set as {10^^; s = 1, . . . , 61}. A more refined grid search can further improve the 
numerical performance at an increased computation cost. 

In simulated examples where the true conditional probability pfc(x)'s are known, the perfor- 
mance of each probability estimation method is measured by its distance to Pfc(x). Various distance 
measures between jjfc(x) and pfc(x) are computed based on the testing set, 



1- Norm error: erri(p A ,p) = tL^2^2 l^( x *) ~~ Pk(*t)\] 

' ' t£T k=l 
1 K 

2- Norm error: err 2 (p\,p) = t=- ^(p fe (x t ) - p k {^t)f] 

' ' t&T k=l 

GKLloss: err KL {p x ,p) = ^ ^p fc (x t )log^|^; 

I I t&T k=1 Pk{ t) 

Cross entropy error: err CE (Px) = _1 °g (Pm ( x 0) , 



1 1 t£T 

where T denotes the testing set, and \T\ is the cardinality of T. 

4.1 Simulated examples 

Five simulated examples are used for comparison, and they are generated as follows. 

Example 1. First, Y is uniformly generated from {1, 2, 3}, and then two-dimensional covariates 
X given Y = y is generated from iV(//(y),E), where //(?/) = (cos(2y7r/3), sin(2?/7r/3)) T , S = 
0.7 2 / 2 with 7 2 being the 2x2 identity matrix. The training size is 400, and the testing size is 2600. 

Example 2. First, generate x\ ~ Unif(— 3, 3), x 2 ~ Unif(— 6, 6), and then generate F ac- 
cording to Pfc(x) with k = 1,2,3. Here logit(p fc (x)) = /fc(x), where /i(x) = —x 1 + 0.1a;f — 
0.05x1 + 0.1, / 2 (x) = -0.2x1 + 0.1x 2 2 -0.2 and / 3 (x) = xi + O.lxj - 0.05x1 + 0.1. The training 
size is 300, and the testing size is 3000. In addition, as in Wu et al.(2010), BLM with quadratic 
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terms x\ and x\ (BLMQ) is also considered in this example. 

Example 3. First, generate X uniformly over the disc {x\ + x\ < 100}, and then gener- 
ate Y G {1,2,3} according to p fc (x) = 3 exp(/fc(x)) with / fc (x) = ^(^(/^(x))), where 

E exp(/jW) 

_ 2 = 1 

/ii(x) = —5-^3^1 + 5rr 2 , fo 2 (x) = — 5-\/3;ei — 5x 2 , /^(x) = 0, and $(•) and T 2 (-) are the cu- 
mulative distribution functions of the standard normal distribution and t-distribution with degrees 
of freedom 2, respectively. The training size is 600, and the testing size is 2400. Clearly, the true 
quantile regression functions can not be modeled through a simple parametric model. 

Example 4. The data are generated similarly in Example 1. First, Y is generated uniformly 
over {1, 2, 3, 4, 5}, and then conditional on Y = y, the two-dimensional covariates X are generated 
from N(fi(y), E), where /j,(y) = (cos(2y7r/5), sin(2y7r/5)) T and E = 0.7 2 / 2 . The training size is 
500, and the testing size is 4000. 

Example 5. First, Y is generated uniformly over {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, and then condi- 
tional on Y = y, the two-dimensional covariates X are generated from T(/x(y), E, df = 2), where 
T is the multivariate t distribution with fi(y) = (2 * cos(y7r/5), 2 * sin(t/7r/5)) T , E = 0.7 2 / 2 and 
degree of freedom 2. The training size is 500, and the testing size is 5000. 

Each simulated example is repeated 50 times, and the averaged test errors and the correspond- 
ing standard deviations are reported in Table 1. In Table 1, if the estimated probability is (as 
produced by TREE or RF), a small constant 10~ 2 will be added to avoid degenerate GKL losses. 

Table 1 here 

Evidently, the proposed estimation method consistently delivers competitive performance com- 
pared with other methods. It outperforms CLM, KMLR, TREE and RF in most examples, and 
yields smaller errors. The performance of BLM largely depends on whether the logistic model 
assumption is satisfied or not. It yields the smallest errors Examples 1 and 4. where the model 
assumption in BLM is satisfied, while it is outperformed by other methods in Examples 2, 3 and 5 
where the model assumption is violated. 
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4.2 Real applications 

In this section, the proposed multiclass probability estimation method is applied to the iris data 
and the white wine quality data. Both datasets are publicly available at the University of California 
Irvine Machine Learning Repository ( http .-//arc hive, ics. uci. edu/mlfy . 

The iris data has 4 continuous attributes: sepal length, sepal width, petal length, and petal 
width, and three classes: Setosa, Versicolour, and Virginica. The size of the iris dataset is 150, and 
each class has 50 observations. We randomly select 30 observations from each class and set as the 
training set, and the remaining 60 observations are used for testing. The white wine quality data 
has 11 attributes, which characterize various aspects of the white wines, and the response ranges 
from to 10 representing quality scores made by wine experts. In this section, we focus only on 
three classes with quality scores 5, 6 and 7, and a total of 4535 white wines are selected, where 
1457, 2198 and 880 white wines score 5, 6 and 7, respectively. We randomly select 100 white 
wines from each class as the training set, and the remaining 4235 white wines are used for testing. 

Note that the true conditional probability pfc(x)'s are unavailable in the real applications, there- 
fore the 1-norm error, 2-norm error and GKL loss are no longer computable, but the cross entropy 
error can still be used for comparison. In addition, we also compare the averaged misclassifica- 
tion error of each probability estimation method on the testing set, where the classification label is 
predicted as y t = argmax A .p/ c (x t ), and the misclassification error is defined as 

MCE(p x ,p) = ^ r ^ t I(y t ^y t ). 

' ' teT 

The averaged cross entropy error and the misclassification error over 50 replications are reported 
in Table 2. 

Table 2 here 

It is clear that the proposed probability estimation method delivers competitive results against 
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other competitors. It yields the smallest cross entropy error and test error in both real examples, 
except that KMLR and RL produce slightly smaller cross entropy error in the iris example. The 
error rates of CLM are not reported in Table 2 for the iris example as it does not converge in most 
replications. 

5 Conclusion 

This paper proposes a novel model-free multiclass conditional probability estimation method, 
where the estimated probabilities are constructed via a series of conditional quantile regression 
functions. The proposed method does not require any distributional model assumption, and it is 
computationally efficient as its computation cost does not increase with the number of classes. The 
asymptotic convergence rate of the proposed estimated probability is established. The effectiveness 
of the proposed method is also demonstrated in both simulated examples and real applications. 



Appendix: technical proofs 



k-1 k 



Proof of Lemma 1. When Pj( x ) < T < Pj( x )' 

j=i i=i 



k-1 



Pr(Y </ r *(x)) 



Pr(Y < k - 1) + Pr [Y = k, -0.5 < e < 




-0.5 



k-1 



k-1 



T ~ EPi( X ) 




The desired result follows immediately. 
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fc k-l 

Proof of Theorem 1. First, note that p&(x) = Ps( x ) — Ps( x ) with Po(x) = 0, and then 

s=0 s=Q 



WPk-Pkh 



k fc-1 fc fc-1 

- J2p° - J2 Ps + Hp* 

s=0 s=0 s=0 s=0 



< 



s=0 s=0 



+ 



fc-1 fc-1 



(13) 



Therefore, it suffices to bound 



Pfcfx 



fc fc 

E P« - E for an Y k. 

s=0 s=0 i 

fc fc 

Next, for simplicity, denote i\(x) = p s (x), Pt(x) = Yl Ps(x), and P fc 

s=0 s=0 

~ ^ Simple calculation yields that 



x : 



Pfc(x) - 



> 



||A-Pfc||i = E\P k (X) - P k (X) 



= E (\P k (X) - P k (X)\ ■ I(B k )j + E (\P k (X) - P k (X)\ ■ I(B c k 

< Pr(B k ) + -Pr(B c k ) < Pr(B k ) + -, 
m m 



where the first inequality follows from the fact that \P k (X) — P k (X) \ is bounded by 1. Therefore, 
bounding || P k — P k \\i boils down to bounding Pr(B k ). 

Based on the estimation method in there exists ji 6 {1, . . . , m — 1}, such that = P k (x), 
and then /^(x) < k + 0.5 and / Tji+1 (x) > k + 0.5. Let A, = {x : |/ T .(x) - /* (x)| > £}, and 
we will show the relationship B k = jx : |Pfc(x) — Pfe(x)| > ^| C \Jj=i Aj in the following 
four cases. 

Case 1. If P fc (x) - P fc (x) > £ and P fc (x) < P fc+ i(x), then P fc (x) + £ < = P fe (x) < 
Pfc + i(x). Based on Lemma 1, 



/* (x) = k + 0.5 + P fc( x ) P ^ x ) > k + 5 + p fe ( x ) _ Pfe ( x ) > fc + 5 + 1 

Pfc+i(x) m 
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which implies that f* n (x) - f Tji (x) > £ > 

Case 2. If P fe (x) - P fe (x) > ^ and P fe (x) > P fe +i(x), then by Lemma 1, /* (x) > k + 1.5 
and4(x)-4(x)>l>i 

Case3. If P fc (x)-P fc (x) > | and P fc (x) > P fc _ 1 (x)-^,thenP fc _ 1 (x)-^ < r h < P fc (x)-£ 
and Pfe_i(x) < r il+ i = r h + ± < P fc (x) - ^. Based on Lemma 1, 

£ (x) = k - 0.5 + Tjl+1 ~ Pfc ' l(x) < k - 0.5 + Pfc(x) ~" < Jfc + 0.5 - -, 
31+1 Pfc(x) - p fc (x) - m' 

which implies that / r . i+1 - > 

Case 4. If P fc (x) - P fc (x) > £ and P fe (x) < P fe _i(x) - ± then r, 1+1 < P fe _i(x) and by 
Lemma 1, f\ +i (x) < k - 1.5 and / TJi+1 (x) - ^ +1 (x) > 1 > £. 

Combining the above four cases, P& C Ujl/ A? = { x : l/r, ( x ) — /*,, ( x )l ^ m f° r some j}. 
It leads to a connection between \\P k — P k \\i and e(/ T , /*) is established in the following. 

|||A - Pk\\i > 1 + mV#*} C {Pr(P) > mVt} 

m— 1 

C {Pr( (J A,) > mV 1 ^} C |Pr(A j ) > ma^C for somej}. 

Therefore, Pr ( ||P fc - P k \\ x > £ + mV 1 ^ < E Pr ( Pr ( A j) > ma^f). In addition, 
Pr(Aj) > ma^Sl implies that ||/ T . - > ^Pr(A j ) = a^5l a . This, together with 

Assumption 2, yields that e(/ Tj ., /* ) > b\. Therefore, 

j'=i 

TO— 1 

< EPr(e(/ T3 ,/; 3 )>5 T 2 t ) <m-max{3.5exp(-a 5 n(AJ rj ) 2 - a )} 

< 3.5m exp(— a 5 n(A J ) 2 ~ a ), 
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where the second to the last inequality follow from a slightly modified version of Theorem 2 in Li 
et al. (2007) incorporating the approximation error in Assumption 1. 



Based on (13), \\p k — Pk\\i > ^ + 2m 2 a 1 1 5 2a implies that at least one of \\P k — P k \\i and 
\Pk-i — is larger than — + m 2 a : [ l 5 2a . Therefore, 

/ ak \ 

Pr \\p\-p\\i > + 2Km 2 a^5 2 n a 

\ m J 

< Pri[j{\\p k - Pk \\ 1 >^ + 2m 2 a^5l a \\ < ^Pr (\\p k - p^ > - + 2mVC) 

\k=l ^ '" ' ) k=l '" 



K •> 

< VPr(||A -P k \\i>- + mWC) + E Pr ( II A-i " -Pfc-illi > - + m 2 a^5 2 n a 
fe=i 

< 7mi^ exp(— a5n(AJo 



fe=i fc=i 

\2-cn 



References 

[1] Breiman, L. (1992). The Little Bootstrap and Other Methods for Dimensionality Selection 
in Regression: X-fixed Prediction Error, Journal of the American Statistical Association, 87, 
738-754. 

[2] Breiman, L. and Spector, P. (1992). Submodel selection and evaluation in regression - 
the X- Random Case, International Statistical Review, 3, 291-319. 

[3] Chen, J. and Lazar, N. (2010). Quantile Estimation for Discrete Data via Empirical Lik- 
ilihood, Journal of Nonparametric Statistics, 22, 237-255. 

[4] Efron, B. (2004). The Estimation of Prediction Error: Covariance Penalties and Cross- 
validation, Journal of the American Statistical Association, 99, 619-632. 

[5] Gu, C. (2002). Smoothing Spline ANOVA Models, New York: Springer- Verlag. 

[6] He, X. (1997). Quantile Curves Without Crossing, The American Statistician, 51, 186-192. 

19 



[7] Hastie, T. and Tibshirani, R. (1998). Classification by Pairwise Coupling, The Annals 
of Statistics, 26, 451-471. 

[8] Kimeldorf, G. AND Wahba, G. (1971). Some Results on Tchebycheffian Spline Func- 
tions, Journal of Mathematical Analysis and Applications, 33, 82-95. 

[9] KOENKER, R. (2005). Quantile Regression, New York: Cambridge University Press. 

[10] Koenker, R. AND Bassett, G. (1978). Regression Quantiles, Econometrica, 46, 33-50. 

[1 1] Lee, Y., Lin, Y. and Wahba, G. (2004). Mulitcategory Support Vector Machines, Theory, 
and Application to the Classification of Micoarray Data and Satellite Radiance Data, Journal 
of the American Statistical Association, 99, 67-81. 

[12] Li, Y, Liu, Y. and Zhu, J. (2007). Quantile Regression in Reproducing Kernel Hilbert 
Spaces, Journal of the American Statistical Association, 102, 255-268. 

[13] Liu, Y. and Wu, Y. (2011). Simultaneous multiple non-crossing quantile regression esti- 
mation using kernel constraints, Journal of Nonparametric Statistics, 23, 415-437. 

[14] Machado, J. and Santos Silva, J. (2005). Quantiles for Counts, Journal of American 
Statistical Association, 100, 1226-1237. 

[15] ROSSET, S. (2009). Bi-Level Path Following for Cross Validated Solution of Kernel Quantile 
Regression, Journal of Machine Learning Research, 10, 2473-2505. 

[16] Shen, X. and Huang, H. (2006). Optimal Model Assessment, Selection, and Combina- 
tion, Journal of the American Statistical Association, 101, 554-568. 

[17] Shen, X. and Wong, W. (1994). Convergence Rate of Sieve Estimates, Annals of Statis- 
tics, 22, 580-615. 



20 



[18] Takeuchi, I., Le, Q. and Sears, T. (2009). Nonparametric Conditional Desity Estima- 
tion using Piecewise-linear Solution Path of Kernel Quantile Regression. Neural Computa- 
tion, 21, 533-559. 

[19] Wahba, G. (1990) Spline Models for Observational Data, Philadelphia: SIAM. 

[20] WANG, J. AND Shen, X. (2006). Estimation of Generalization Error: Random and Fixed 
Inputs, Statistica Sinica, 16, 569-588. 

[21] Wang, J., Shen, X. and Liu, Y. (2008). Probability Estimation for Large Margin Classi- 
fiers, Biometrika, 95, 149-167. 

[22] Wu, T., Lin, C. and Weng.R. (2004). Probability Estimates for Multi-Class Classification 
by Pairwise Coupling, Journal of Machine Learning Research, 5, 975-1005. 

[23] Wu, Y., Zhang, H. and Liu, Y. (2010). Robust Model-Free Multiclass Probability Esti- 
mation, Journal of the American Statistical Association, 105, 424-436. 

[24] Wu, Y. and Liu, Y. (2009). Stepwise Multiple Quantile Regression Estimation using Non- 
crossing Constraints, Statistics and Its Interface, 2, 299-310. 

[25] Yu, K., Lu, Z. and Stander, J. (2003). Quantile Regression: Applications and Current 
Research Areas, The Statistician, 52, 331-350. 

[26] Zhou, D. (2002). The Covering Number in Learning Theory, Journal of Complexity, 18, 
739-767. 



21 




22 



ible 1: Simulated examples. Estimated means and standard errors (in parentheses) of 1-norm, 

norm, GKL loss and cross entropy error (CEE) for our new method, BLM, CLM, KMLR, TREE 
RF based on 50 replications. 

1-norm 2-norm EGKL CEE 

Example 1 

KQR 0.23(0.022) 0.04(0.004) 0.08(0.008) 0.53(0.014) 

BLM 0.06(0.025) 0.01(0.003) 0.01(0.005) 0.46(0.013) 

CLM 0.57(0.016) 0.21(0.005) 0.31(0.008) 0.76(0.013) 

KMLR 0.40(0.014) 0.07(0.005) 0.15(0.008) 0.59(0.012) 

TREE 0.25(0.026) 0.06(0.013) 0.12(0.027) 0.56(0.031) 

RF 0.25(0.020) 0.05(0.009) 0.12(0.019) 0.56(0.027) 

Example 2 

~KQR 0.26(0.022) 0.05(0.012) 0.07(0.013) 0.60(0.016) 

BLM 0.44(0.012) 0.13(0.006) 0.18(0.010) 0.72(0.016) 

BLMQ 0.11(0.026) 0.01(0.005) 0.02(0.007) 0.55(0.015) 

CLM 0.44(0.011) 0.12(0.004) 0.17(0.006) 0.71(0.011) 

KMLR 0.44(0.012) 0.13(0.006) 0.18(0.010) 0.72(0.016) 

TREE 0.36(0.037) 0.10(0.016) 0.23(0.034) 0.76(0.053) 

RF 0.30(0.016) 0.07(0.007) 0.14(0.016) 0.67(0.031) 

Example 3 

~KQR 0.28(0.013) 0.07(0.004) 0.20(0.040) 0.68(0.015) 

BLM 0.30(0.008) 0.07(0.003) 0.12(0.004) 0.68(0.015) 

CLM 0.63(0.008) 0.23(0.005) 0.35(0.006) 0.97(0.011) 

KMLR 0.56(0.016) 0.13(0.010) 0.31(0.016) 0.94(0.011) 

TREE 0.22(0.021) 0.07(0.012) 0.13(0.020) 0.69(0.029) 

RF 0.25(0.015) 0.06(0.006) 0.11(0.013) 0.67(0.023) 

Example 4 

KQR 0.32(0.015) 0.04(0.005) 0.13(0.008) 1.05(0.012) 

BLM 0.10(0.025) 0.01(0.003) 0.01(0.006) 0.93(0.012) 

CLM 0.82(0.006) 0.24(0.004) 0.50(0.008) 1.42(0.007) 

KMLR 0.48(0.013) 0.08(0.004) 0.19(0.009) 1.11(0.008) 

TREE 0.40(0.019) 0.08(0.008) 0.17(0.018) 1.09(0.020) 

RF 0.44(0.018) 0.09(0.008) 0.23(0.018) 1.15(0.023) 

Example 5 

~KQR 0.56(0.019) 0.10(0.006) 0.28(0.016) 1.58(0.023) 

BLM 0.59(0.069) 0.01(0.019) 0.45(0.046) 1.74(0.047) 

CLM 1.10(0.009) 0.27(0.004) 0.87(0.013) 2.17(0.015) 

KMLR 0.88(0.008) 0.18(0.003) 0.54(0.010) 1.84(0.010) 

TREE 0.54(0.027) 0.10(0.010) 0.24(0.026) 1.54(0.038) 

RF 0.57(0.023) 0.12(0.010) 0.31(0.024) 1.61(0.038) 
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Table 2: Real applications. Estimated means and standard errors (in parentheses) of cross entropy 
error (CEE) and misclassification error (MCE) for our new method, BLM, CLM, KMLR, TREE 
and RF base on 50 replications. 





CEE 


MCE 


Iris example 


KQR 


0.18(0.026) 


0.04(0.0250) 


BLM 


2.37(2.026) 


0.06(0.0256) 


CLM 






KMLR 


0.16(0.017) 


0.05(0.0232) 


TREE 


0.24(0.093) 


0.07(0.0227) 


RF 


0.12(0.030) 


0.06(0.0237) 


Wine quality example 


KQR 


0.93(0.013) 


0.48(0.0143) 


BLM 


0.98(0.023) 


0.51(0.0123) 


CLM 


0.97(0.009) 


0.49(0.0073) 


KMLR 


0.98(0.019) 


0.51(0.0115) 


TREE 


1.49(0.131) 


0.53(0.0176) 


RF 


0.93(0.013) 


0.48(0.0145) 
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